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ABSTRACT 

The potential of item response theory (IRT) for 
solving a number of testing problems in the Maryland Functional 
Reading Program would appear to be substantial in view of the many 
other promising applications of the theory. But, it is well-lonown 
that the advantages derived from an IRT model cannot be achieved when 
the fit between an item response model and the test data of interest, 
is less than adequate. The principal purpose of the research reported 
in this paper was to investigate the fit of the one-, two-, and 
three-parameter logistic models to the test results obtained from the 
administration of the 1982 Maryland Functional Reading Test (MFRT) . 
The evidence addressing models-data fit seemed clear: a two-parameter 
logistic model was anfcle to adequately account for examinee 
performance on the> MFRT. The one-parameter model could not handle the 
substantial variation among test items in their discriminating power. 
The three-parameter model improved the fit only slightly because of 
the minimum amount of guessing on the test. Several suggestions were 
offered in the paper for conducting goodness-of-f i t investigations. 
(Author) 
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Abstract 

The potential of item response theory (IRT) for solving a number of 
testing problems in the Maryland Functional Reading Program would appear to 
be 'substantial in view of the •many other promising applications of the 
theory. But, it is well-known that the advantages derived from an IRT model 
cannot be achieved when the fit between an item response model and the test 
data of interest is less than adequate. The principal purpose of the 
research reported in this paper was to investigate the fit of the one-, 
two-, and three-parameter logistic models to the test results obtained from 
the .administration of the 1982^ Maryland Functional Reading Test. 

The evidence addressing model-data fit seemed clear: A two-parameter 
logistic model was able to adequately account for examinee performance on 
the MFRT. The one-parameter model could not handle the substantial 
variation among test items in their discriminating power. The 
three-parameter model improved the fit only slightly because of the minimum 
amount of guessing on the test. Several suggestions were offered in the 
paper for conducting goodness-of-f it investigations. 
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tting Item Response Models^ to the Maryl 
Functional Reading Test Results 1, 2 

Ronald K. Hambleton and Linda Murray 
University of Massachusetts", Amherst 

and 

Paul Williams 
Maryland Department of Education 



The potential of item response theory (IRT) for solving a number" of 
testing problems in the Maryland Functional Reading Program would appear to 
be substantial in view of the many other promising applications of the 
theory (see, for example, ^Hambleton, 1983; Lord, 1980). But, it is 
well-known that the advantages derived from an IRT model cannot be v achieved 
when the fit between an item response model and the test dat^ of interest 
is less than adequate. The principal purpose of the research reported in 
this paper was to investigate the fit of the one-, two-, and 
three-parameter logistic models to the test results obtained from the 
administration of the 1982 Maryland Functional Reading Test. 



Test Description and 4Jse 



Method 



\ 



nek ' 



In the Fall of 1982 the MarylancK Functional Reading Test - Le,vel II 
was given to approximately 54,000 ninth graders. The Level II test 



*A paper presented at the annual meeting of NCME, Montreal, 1983. 

^ Laboratory of Psychometric and Evaluative Research Report No. 139 . 
Amherst, MA: School of Education, University of Massachusetts, 1983* 
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consisted of 75 operational items from, five content domains. These five 
areas, (1) Following Directions, (2) Locating Information, (3) Main Idea, 
(4) Using Details, and (5) Understanding Forms, are the units used for 
reporting diagnostic scores to teachers and parents, £x\ overall test scale 
score of 340 represents the passing standard. This test must he passed 
before students are eligible^fp*^graduation from Maryland's public schools. 
If the certification requirement is not met in the ninth grade, the local 
school system is obi i gated by law to provide appropriate instructional 
assistance before retesting the student yearly. 



Sample 

From the ( a pp rox i ma t^Aw ) 54,000 students who were administered the 
test in the Fall of 1982,™ Bl the purpose of our analyses, a 5% "spaced 
sample" was drawn. Specifically, .every twentieth student from the master 
student file was used. The resulting sample of 2662 students provided a 
sufficiently large sample to carry out the logistic model analyses on the 
data. 

Analyses « " 

a. 

The logistic model item and ability parameter estimates were obtained 

from the computer program LOGIST (Wingersky, Barton, ^,Lord," 1982). Next, 

the goodness of fit between the one-, two-, and three-parameter models and 

v . 
the test data was addressed with residuals, specifically, standardized 

residuals using a computer program prepared by Hambleton and -Murray 

(1983). To obtain thesfe' standardized residuals, the ability scale was 

divided into 12 equal intervals between ability scores of -3.0 and +3.0. 

In each interval the difference between the actual item performance 



(p-vaiue) of the examinees and the expected item performance obtained from 
the estimated item characteristic curve (ice) was divided by the standard 
error associated with the p-value to obtain a standard ized res i dual (%R) . 
A SR was obtained at each ability level for each test item. Sines the 
direction of the differences was unimportant for many of the analyses, 
absolute-valued standardized residuals were typically used. 

Result 8 

The classical item an a Ly s i s results and the absolute-valued 
standardized residuals (SRs) obtained with the three logistic modela are 

reported in Table 1. Three comments based on a study of Table 1 seemed 

* * • • • 

appropriate. First, and not surprising since the test was measuring 

competencies that many students were expected to be masters" of the average 

item performance was high (77*8Z correct). This finding suggested that the 

"pseudo-chance level 11 parameter in the three-parameter model was apt to be 

of limited value in fitting a model to the. data since guessing was an 

insignificant factor in test performance. Second, while some of the 

variation in the biserial correlations was due to the instability of these 

statistics with very easy items, there seemed to be a rather substantial 

variation in the discriminating power of test items. The biserial 

correlations varied from .15 to slightly over 1 (it is possible to^obtain 

biserial' correlations over 1). This prel iminary f ind ing suggested that the 

two-parameter model would probably fit the test data better than the 

one-parameter model. Third, a cursory analysis of the SRs in Table 1 

showed, in fact, that the two- and three-parameter models produced highly 

comparable fits to the test data and, on the average, better fits to the 

data than the one-parameter model. The minor reversals in the SR values 



Table 1 . 

Maryland Functional Reading Test Item Statistics! 
(N-2662; 1982) 

^- Absolute- Valued 

Teat Proportion Biaerial Content Standardized Residuals 

Item Correct Correlation Category 1 1-p . 2-p 3-p 



1 


\.97 


.74 


. 1 


0.92 


0. 57 


0/62 


2 


V95 . 


.59 




.62 


.64 


.81 


3 


;.88 


.30 


J 


2. 52 


.84 


.72 


4 


.91 


. 70 




1.18 


.80 


.73 


5 


. 94 


. 66 


j 


. 83 


.91 


.61 


6 


.45 


.36 


1 


2.87 


1. 70 


1.35 


7 


.83 


. 59 




.84 


.61 


.62 


8 


. 94 


. 77 


[ 


1 . 28 


. 79 


.61 


9 


. 73 


.35 




2.67 


1.12 


1 . 18 


1 0 


. 88 


. 55 




.61 


.64 


.59 


1 1 


. 89 


. 34 




2.00 


.64 


.76 


12 


.93 


.70 




1.04 


.81 


.83 


13 


.98 


.67 




.66 v 


. 75 


.73 


14 


.79 


.44 




1.65 


.70 


.77 


15 


.86 


.58 




.88 - 


1.29 


.96. 


16 


'.78 






2. 38 


.95 


1 

.68 


17 


• 91 


. 72 




1.07 


.67 


.61 


18 


.74 


.35 


2 


2.61 


.62 


.53 


19- 


'.90 


.44 


2 


1. 37 


.89 


.69 


20 


.95 


. 52 


2 


.69 


.79 


.48 


21 


.98 


.67 


2 


. 58 


. 59 


. : 41 


22 


.93 


- . 72 


2 


1.10 


.74 


.62 


23 


.79 


. 50 


2 


1.17 


.63 


.73 


24 


.87 


i68 


2 


1 . 67 


.97, 


.98 


25 


.86 


.65 " 


2 


1.09 


.89 


.83 



Content categories: 1 "Fo I lowing - Directions , 2«Locat ing Information , 

3-Main Ideas, 4-Using. Detail, 5-Unde^rstanding Forms, 



v 

v, 



-5- 



Table 1 (continued) 



Absolut e-Valued 

Teat Proportion Biserial Content Standardized Residuals 

Item Correct Correlation Category 1-p 2-p 3-p 



26 


. 57 


• 3o 


2 


O Q 1 

Z . ol 


. OO 


Q 0 

. oz 


27 


. 83 


. 55 


2 


1.41 


1 . JO 


1 0 Q 
1 . JO 


28 


.84 


. 59 


2 


.66 


.67 


. 53 


- 29 


■ .88 


.70 ' 


2 


1.37 


K01 


1 .05 


30 


. 89 


. 77 


z 




A Q 
• 0? 


■7 0 


** 31 


. 97 


On 

. 80 


z 


Q 0 


1 0 


ftQ 


32 


. 88 


r 00 


z 


i in 


1 Q 


. oy 


33 


.87 


.68 


■2 


1.34 


1.04 


1. 10 


34 . 


.55' 


.44 


2 


2.00 


.69 . 


.89 


35 


. 59 

*> 


. 4 J 


J 


0 OA 

z . z4 


1 A 1 
I . 0 1 


1 . JZ 


36 


. 75 


. 54 


J 




1 . D J 


1 , 4 J 


37 


. 70 


. 60 


J 


1 in 
1 . /U 


1 QQ 


1 1 n 


38 


.23 


.20 


3 


4.42 


.65 


.90 


39 


.71 


. 73 


3 


2.49 


1.92 


1.13 


40 


. 71 


. 56 


J 


l . Uz 


1 . U J 


1 ni 


41 


. 57 


.43 


*> 
3 


1 no 

1 . 98 


1 • ZO 


. 94 


42 


.69 


. 62 


3 


1 CI 

1.51 


1 0£. 

1 . zo 


O Q 

. 88 


43 


.55 ^ 


.46 


3 


1.27 


.89 


1.03 


44 


.56 


.52 


3 


1.86 


1.51 


1.40 


45 


. 54 


. 60 


*> 

3 


1 . 0 0 


1 . 59 




46 


. 70 


. 62 


3 


1 . 50 


1 ID 

1 . Jo 


n ^ 

. 9 7 


47 


.79 


. 70 




1 CI 

1.57 


q n 
. oU 


OA 

. o4 


48 


« .85 


.65 


4 


1.45 


1.22 


.85 


A Q 


• oo 


i OJ 




2 . 09 


. 80 


. 93 


50 


.93 


1 .03 


4 


2.92 


1.09 


1.02 


51 


.79 


.68 


4 


1.06 


.84 


.83 


52 


.95 


.98 


4 


2.11 , 


.93 


.81 


53 


.69- 


.62 


4 


1.20 


. 79 


.86 


54 


.88 


.66 


4 


.81 


.81 


.65 


55 


.94 


.95 


4 


2.19 


.87 


.90 


56 


.87 . 


.63 


4 


• 92 


1 .02 


1 .05 


57 


.93 


.91 


4 


2.15 


.78 


. 71 


58 


.76 


.83 


4 


1. 19 


1.15 


1.00 


59 


.71 


.51 ' 


4 


1.35 


1.41 


1 . 37 


60 


.73 


.62 


4 


1.13 


.79 


.83 
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Table 1 (continued ) 



Absolute-Val ued 



Test 


Propor t ion 


Biserial 


Content 


Standardized Residuals 


I tem 


Cor rec t 


Correlation 


Ca tegory 


1 P 


2-p 


3-p 


61 


~J A 

. 74 


. 51 


/. 


3.69 


1.62 


1.53 " 


62 


.31 


.15 


4. 


5.73 


1.23 


.94 


63 


.73 


.55 


4 


1 1 A 


. 99 


Q 1 


64 


o n 

. 89 


. 76 


5 


r . J4 


Q 1 

• ol 


1 A 

. /4 


c c 
05 




. 5 5 


c 

. 5 


..72 


. 90 


n o 

. 98 


66 


Q 1 

. 0 1 


a i 
.41 


c 

5 


2 . 73 


1.83 


1.85 


67 


.71 


. 54 


5 


1 0L 

1 . U*T 


i ?n 


1 1 A 
1 • 1 o 


DO 


7 S 


• o / 




1.61 


.84 


1.05 


69 


.91 


.94 


5 


2.72 


.84 


.95 


70 


.78 


.67 


5 


1.09 


.65 


, -59 


71 


.79 


.70 


' 5 


1.34 , 


.72 


4 .69 


72 


.29 


.36 


5 


2.00 


.52 


.55 


73 


.78 


.66 


5 


.97 


.70 


.96 


74 


.75 


.61 


5 


.57 


.66 


.82 


75 


.73 


.65 


5 


1.29 


.71 


.84 
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were due to problems in parameter estimation and the tendency of SRs to 
"blow-up" in the low and high ability categories where the standard errors 
were often very small. 

Table 2 provides the results of a more thorough analysis of the 
standardized residuals. When an item response model fits a set of test 
data, the 8 t and ar d i zed res i dua 1 8 should be distributed approximately 
normally. In fact, the distributions of the^ standardized residuals obtained 
with the two- and three-parameter logistic models were approximately 
normal. These results are especially interesting because some preference 
was given in test development to items* that fit the one-parameter model. 
Unfortunately, the goodness of fit studies carried out in the test 
development stage were done with the BICAL program. But, it is now 
well-known that the goodness of fit tests in th^ computer program have 
problems (Divgi, 1981; van den Wollenberg, 1982). With the one-parameter 
model, about 30% of the SRs exceeded an absolute value of 2.0 whereas only 
about 52 would have been predicted had the model fit the test data. 

Table 3 provides information pertaining to the fit of the three 
logistic models in 12 ability categories. With respect to bias as 
reflected in the average standardized residuals, the statistics from the 
three models were similar although the two- and three-parameter models 
produced slightly less bias in accounting for the data. With respect to 
overall fit, as reflected in the average absolute-valued standardized 
residuals, again, the two- and three-parameter models provided better fits 
to the data. Regardless of the ability level, the fits were substantially 
better with the more general models. 

One of the two main assumptions of the three logistic test models is 
that of unidimensionality . * One check on the validity of the assumption has 



Table 2 

Analysis of the Absolute-Valued Standardized Residuals 1 
With Three Logistic Test Models for the MFRT 1 



Lq^gistic Percent of Absolute-Valued Standardized Residuals 

Model |0 to 1 | 1 1 to 2 | 1 2 to 3 | | gver 3 1 



1 \ 42.6 27.8 15.0 14.6 

2 V 60.6 29.7 7.3 > 2.4 

3 / >' 63.3 29.6 6.0 # 1.1 



^Total number of residuals is 825. 



j 
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table 3 

Analysis of Standardized Residuals at Eleven Ability Levels with the One-, 
Two- and Three-Parameter Logistic Models for the MFRT 
V (N-2662; 75 items) 



Standard i zed 
Residual 



Statistic 


Logistic 
Model 


-2 . 75 


-2 . 25 


1 It 

-1.75 


1 o c 

-1 . 25 


Ab i 1 i C y 
-.75 -.25 


Level 
7 


. / J 


1 

1 . LJ 


1 7S' 
1 • / J 


7 7S 


Total 

V unwe i g n l cu / 


Number 


1 


25 


51 


116 


218 


409 


• 456 






on 7 


1 17 


7Q 




of 


2 


1 A 
1 0 


A 1 
H J 


QQ 


OA 0 


429 


531 


Aft 1 


'ML 


71 9 


1 1 6 






Examinees 


. 3 


22 


50 


100 


224 


406 


528 


491 


387 


228 


117 


, 49 




Average 


i 


.40 


.30 


.28 


.28 


.39 


.30 


-.02 


.20 


.27 


.40 


.38 


.29 


Standardized 2 


.39 


.38 


.40 


.29 


.17 


',01 


-.05 


-.04 


.18 


.33 


.36 


.22 


Residual 


3 


.12 


.31 


.29 


.28 


.24 


\09 


-.0* 


-.05 


.08 


.34 


.30 


.18 


Average 


1 


1.70 


1.90 


2.05 


1.56 


1.53 


: 1.31 


1 . 57 


2.26 


1.75 


1.37 


.68 


1 .61 


Absolute- 


2 


1.22 


1.06 


1.19 


.72 


1 .07 


1.01 


. 70 


.97 


.93 


.94 


.76 


.96 


Valued 


3 


.98 


1.07 


1.11 


.68 


.97 


.98 


.64 


.85 


.84 


.9.3 


.72 


.89 
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to do with the pattern of residuals for tjest items classified by content/ 
Test items within a content category m*y show a different pattern ( of 

r ; 

residuals if they "-tap" a different tra^t from the one measured by the 
items in the other content categories. Alternately, with^the one-parameter 
model, a.different pattern of residuals -may also indicate the subset of 
test items has a relatively high or low average discriminating power in 
relation <o the remainder of the items in the test although the items' may 
measure the same trait as the other items^in the test. Such air explanation 
however can not explain the results with the two- or three-parameter, model 
since variation in item discriminating power can be handled by the models. 
A study of the statistics in Table 4 suggests that the "main idea domain" 
of test items may be tapping a separate trait from the remaining test 
items. A more careful review of the test suggests that this hypothesis may 
be reasonable since the 12 "main idea" items appear to be "tapping" reading 
comprehension whereas the other fouf content domains appear to IxJ measuring 
study skills. The three-parameter model fits the 12 items by assigning 
"low discriminating powers" to these items and thereby reducing the 
importance of these items to the total test scores and corresponding 
ability estimates. But this strategy of handling "deviant" items is 
undesirable too. In subsequent work with* the test, more attention should 
be focused on the unidfrmensjonality assumption and ways for proceeding when 
the assumption is violated to a substantial degree. ^ 
Table 5 provides the results from another analysis of the SRs. This 
t ime , the average abso 1 ut e-va 1 ued SRs were sorted by "easy" and "hard" 
items and reported for each model. Again, the improved fit obtained with 
the more general models was evident. The 2-P results were^ substant iaMy 
better than the l-P results regardless of the item dif f iculty levels* With 



n • win 

u 
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Table 4 



Association Between Absolute-Valued Standardised Residuals 
and^ Items Content on the MFRT 



- ' ' * % of Standardized Residuals 

Number 4 1-P \ 2-P , 3=rP, 

Content, » of SR(<1 .6) SR(>1,. 0) SRULO) SR(>1.0) SR(<1.0) SR(>1.0) 

Category Items (n=16) (n=*59) (n=5Q) " (n*25) (n=56) (n-19) 



Following 
Directions 17 



Locating 
Information 17 



Mairv 
Idea 

Using 
Details 



12 



17 



understanding 
Forms 12 



41.2 58.8 

23.5 76.5' 

.0.0 * lOQ^D 

.11.8' 88.2 



82. A 17.6 

82. A 17.6 

16.7 8$. 3 

58\8 . 41.2 



25.0 



75.0 $ 83.3 16.7 



— ^ 

88.2 11.8 

82.4 ^ 17.6 

41.7 58.3 

76.5' " 23.5 

75.0 ' 25.0 



i 2 = 8.32 



( 2 = 19.24 - 



X 2 = 9.12 



d.f.=4 p=.082 . d.f.=4 p=.00 <J.f.»4 p=.058 



* 
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Table 5 



Associations* Between Absolute-Valued Standardized Residuals 
and Item Difficulties for the MFRT 



Difficulty 
Level 



Hard (^.75) 



Standardized 
Residual 



SRU.1.0) 
SR(>1.0) 



N 



1 

25 



1.3 
33.3 



Results 

2-P 
N % 



3-P 



15 



14.7 
20.0 



15 
11 



20.0 
14.7 



Easy (p>.75) 



SRU1.0) 
SR(>1.0) 



jar 

15 
34 



20.0 
45.3 



39 
10 



52.0 
13.3 



41 
8 



54.7 
10.7 



X 2 = 5.74' X 2 = 9.01" X 2 - 4.76 

d.f.=l p=.017 d.f. = l p».003 d.f.-l p,-.029 



9 
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the hard items, there was also a slight reduction in the SRs through the 
use of the three-parameter model. Since many of the so-called "hard items 11 
were still relatively easy Cp's > *50) it was not surprising to observe the 
smal 1 -impact of the "pseudo-chance level" parameter in the three-parameter 
model. 

Table 6 provides andther breakdown of the SRs.- The results show that 
(1) the easy items were fit better by tho^item response models than the 
hard items, (2) the two- and thre^-parameter models fit th^e data irf a 
similar fashion and both models fit the data better than the ofte-parameter 
model, and (3) the biggest improvements in fit through the use of the two- 
and three-parameter models were obtained with the harder test items. 

-In a final analysis, a close study of the relationships between SRs 
and biserial correlations was carried out since earlier analyses revealed 

v .. . A 

improvements resulting from the addition of a discrimination parameter to 

_ J 

the one-parameter model. The results in Table 7 and Figures 1 an<i 2 shoV 
dramatically the impact of the use of an item discrimination parameter in 
the chosen item response model. Items with low oj high biserial 
correlations were not fit as well by the ^ne-parameter model as the other 
two models. For example, the curvilinear relationship so apparent in 
Figure 1 vanished when the two-parameter model was fit to the test data. 

* "\ Conclusion 

The initial evidence adressing model-data fi»t peems clear: A two- 
parameter logistic model can adequately account for examinee performance on 
the MFRT. The one-parameter model did not handle the substantial variation 
among test items in their discriminating power. This finding is somewhat 
surprising since the original item pool had already, been reduced somewhat 

15 
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Table 6 



"Statistical Analysis of the Absolute-Value 
Standardized Residuals for the MFRT 




Difficulty 
Level 



Number 

of 
Items 



X SD 



Results 

2-P 
X SD 



3-P 
X SD 



Hard (p<.75) 



Easy (p>. 75) 



26 



49 



• \ 

2.07 1.15 1.15 .40 1.01 



1.37 



25 



,62 .86 .25 .83 .2f 
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Table 7 

Relationship Between Item Biserial Correlations 
and Standardized Residuals for the MFRT 



Logistic {Standardized Item Biserial Correlation 

Model Residual I .00 to .50 .51 to .70 .71 to 1.00 



(20) (41) (14) 

1-P 0.00 to 1.00 , 0.0 34.1 14.3 

1.01 to 2.00 ' 45.0 65.9 35.7 

over 2.00 55.0 0.0 50.0 



X 2 - 31.74 d.f.=*4 p-.OOO 
Eta - .608 



2-P 0.00 to 1.00 65.0 - 61.0 85.7 

1.01 to. 2. 00 35.0 39.0 * 14.3 

over 2.00 '0.0 0.0 0.0 

X 2 - 2.91 ,d.f.-2 j?-.234 
Eta = .197 



3-F . 0.00 to 1.00 "70.0 . 73.2 85.7 

1.01 to 2.00 30.0 26.8 14.3 

over 2.00 0.0 .0.0 0.0 

■X 2 -' 1.18 ^ d.f*2. p-.554 
Eta - .126 
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Figure 1. Plot of item absolute-valued standardized residuals obtained 
with the one-parameter model versus item biserial correlations. 
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Figure 2. Plot of item absolute-valued standardized residuals obtained 
with the two- parameter model versus item biserial correlations. 
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1 

by removing items thTat failed to fit the one-parameter model. The three- 
parameter model improved the fit only slightly because of the minimum 
impact of guessing behavior^on test performance. ^ 

With respect to addressing the fit between an item seap^nse model and 
a set of test data for some desired application, our view is that the best 
approach involves (1) designing and implementing a wide variety of 
analyses , (2) interpreting the results , and (3) judgmental ly determining 
the appropriateness of the intended application. Analyses should include 
investi gations of model assumptions, the extent to which desired model 
features are obtained, and comparisons between model predictions and actual 
data. With respect to the latter, fitting more than one model and 
comparing (for example) residuals provides information that is invaluable 
in determining the usefulness of models. Of course there is no limit to 
the number of investigations that can be carried out. The amount of effort 
extended in collecting, analyzing, and interpreting results must be related 
to the importance and nature of the intended application. In this study, 
only a few of the necessary types of investigations for selecting an item 
response model were carried out and so it would not be appropriate to 

recommend one model over another at this time. For one, the practical 

v .... 
consequences of the one-parameter model misfit might be studied to 

determine its significance in a state-wide test program. Still, there 

seems to be sufficient evidence to warrant a recommendation that the 

Maryland Department of Education give serious consideration to the 

two-parameter model with their MFRT. g Revising the test ^content so that a 

one-parameter model will fit the test data, or assessing ability scores 

with a one-parameter model that does not fit the data as we 11 as a 

two-parameter model, seem to be undesirable alternatives for a statewide 
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testing program. Utilising a three-parameter model seems to be unnecessary 
at this time with grade 9 students in view of the added complexity, cost, 
and minimal advantages derived from the model ifith the MFRT. 
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